Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suppress warning message of pandas_on_spark to_spark #1058

Merged
merged 2 commits into from
Jun 1, 2023

Conversation

thinkall
Copy link
Collaborator

Why are these changes needed?

To suppress warning message of pandas_on_spark to_spark like below:

PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `to_spark`, the existing index is lost when converting to Spark DataFrame.

The warning message could appear many times during the AutoML/tuning process, which is annoying and unnecessary.

Related issue number

Checks

@thinkall thinkall requested review from 0101, sonichi, levscaut and qingyun-wu and removed request for 0101 May 30, 2023 08:51
Copy link
Collaborator

@levscaut levscaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suppress is validated in our customize build of FLAML, which could greatly reduce redundant and unformatted information from console.

Copy link
Contributor

@sonichi sonichi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is OK. Just a reminder that these functions are on the critical path and the import statement inside them can slow down the performance.

@thinkall
Copy link
Collaborator Author

thinkall commented May 31, 2023

The solution is OK. Just a reminder that these functions are on the critical path and the import statement inside them can slow down the performance.

Thanks for the reminder. Below is a simple test for the performance.

def test_import():
    import warnings
    warnings.filterwarnings("ignore")

    y = 3 + 5
    x = y + 2
    
    return x, y


def test_no_import():
    y = 3 + 5
    x = y + 2
    
    return x, y


if __name__ == "__main__":
    import timeit
    num_calls = int(1e6)
    print(timeit.timeit("test_import()", setup="from __main__ import test_import", number=num_calls))
    print(timeit.timeit("test_no_import()", setup="from __main__ import test_no_import", number=num_calls))

Results on my local machine:

0.6618426890054252
0.06726869300473481

Looks like the defect is acceptable.

@sonichi
Copy link
Contributor

sonichi commented Jun 1, 2023

The solution is OK. Just a reminder that these functions are on the critical path and the import statement inside them can slow down the performance.

Thanks for the reminder. Below is a simple test for the performance.

def test_import():
    import warnings
    warnings.filterwarnings("ignore")

    y = 3 + 5
    x = y + 2
    
    return x, y


def test_no_import():
    y = 3 + 5
    x = y + 2
    
    return x, y


if __name__ == "__main__":
    import timeit
    num_calls = int(1e6)
    print(timeit.timeit("test_import()", setup="from __main__ import test_import", number=num_calls))
    print(timeit.timeit("test_no_import()", setup="from __main__ import test_no_import", number=num_calls))

Results on my local machine:

0.6618426890054252
0.06726869300473481

Looks like the defect is acceptable.

For non-spark case 0.5s overhead each call is considered as big because it happens many times in the inner loop, not just once.

@thinkall
Copy link
Collaborator Author

thinkall commented Jun 1, 2023

The solution is OK. Just a reminder that these functions are on the critical path and the import statement inside them can slow down the performance.

Thanks for the reminder. Below is a simple test for the performance.

def test_import():
    import warnings
    warnings.filterwarnings("ignore")

    y = 3 + 5
    x = y + 2
    
    return x, y


def test_no_import():
    y = 3 + 5
    x = y + 2
    
    return x, y


if __name__ == "__main__":
    import timeit
    num_calls = int(1e6)
    print(timeit.timeit("test_import()", setup="from __main__ import test_import", number=num_calls))
    print(timeit.timeit("test_no_import()", setup="from __main__ import test_no_import", number=num_calls))

Results on my local machine:

0.6618426890054252
0.06726869300473481

Looks like the defect is acceptable.

For non-spark case 0.5s overhead each call is considered as big because it happens many times in the inner loop, not just once.

The overhead is for 1e6 calls.

@thinkall thinkall requested a review from sonichi June 1, 2023 04:25
@sonichi sonichi added this pull request to the merge queue Jun 1, 2023
Merged via the queue into microsoft:main with commit d36b2af Jun 1, 2023
@thinkall thinkall deleted the suppress_warning branch June 1, 2023 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants